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Abstract 

This paper addresses issues in automated 
treebank construction. We show how stan- 
dard part-of-speech tagging techniques ex- 
tend to the more general problem of struc- 
tural annotation, especially for determin- 
ing grammatical functions and syntactic 
categories. Annotation is viewed as an in- 
teractive process where manual and auto- 
matic processing alternate. Efficiency and 
accuracy results are presented. We also dis- 
cuss further automation steps. 



1 Introduction 

The aim of the work reported here is to construct 
a corpus of German annotated with syntactic struc- 
tures (treebank). The required size of the treebank 
and granularity of encoded information make it nec- 
essary to ensure high annotation efficiency and accu- 
racy. Annotation automation has thus become one 
of the central issues of the project. 

In this section, we discuss the relation between au- 
tomatic and manual annotation. Section 2 focuses 
on the annotation format employed in our treebank. 
The annotation software is presented in section 3. 
Sections 4 and 5 deal with automatic assignment of 
grammatical functions and phrasal categories. Ex- 
periments on automating the annotation are pre- 
sented in section 6. 

1.1 Automatic vs. Manual Annotation 

A problem for corpus annotation is the trade-off be- 
tween efficiency, accuracy and coverage. Although 
accuracy increases significantly as annotators gain 
expertise, incorrect hand-parses still occur. Their 
frequency depends on the granularity of the encoded 
information. 

Due to this residual error rate, automatic anno- 
tation of frequently occurring phenomena is likely 



to yield better results than even well-trained hu- 
man annotators. For infrequently occurring con- 
structions, however, manual annotation is more reli- 
able, as is manual annotation of phenomena involv- 
ing non-syntactic information (e.g., resolution of at- 
tachment ambiguities based on world knowledge). 

As a consequence, efficiency and reliability of an- 
notation can be significantly increased by combining 
automatic annotation with human processing skills 
and supervision, especially if this combination is im- 
plemented as an interactive process. 

2 Annotation Scheme 

Existing treebanks of English ((Marcus et al., 1994), 
(Sampson, 1995), (Black et al., 1996)) contain con- 
ventional phrase-structure trees augmented with an- 
notations for discontinuous constituents. As this en- 
coding strategy is not well-suited to a free word or- 
der language like German, we have focussed on a less 
surface-oriented level of description, most closely re- 
lated to the LFG f-structure, and representations 
used in dependency grammar. To avoid confusion 
with theory-specific constructs, we use the generic 
term argument structure to refer to our annotation 
format. The main advantages of the model are: it is 
relatively theory-independent and closely related to 
semantics. For more details on the linguistic speci- 
fications of the annotation scheme see (Skut et al., 
1997). A similar approach has been also successfully 
applied in the TSNLP database, cf. (Lehmann et al., 
1996). 

In contrast to conventional phrase-structure 
grammars, argument structure annotations are not 
influenced by word order. Local and non-local de- 
pendencies are represented in the same way, the 
latter indicated by crossing branches in the hier- 
archical structure, as shown in figure 1 where in 
the VP the terminals of the direct object OA (den 
Traum von der kleinen Gaststdtte) are not adjacent 
to the head HD aufgegeben 1 . For a related handling 



See appendix A for a description of tags used 



The dream of the small inn has he yet not given up 

'He has not yet given up the dream of a small inn. ' 



Figure 1: Example sentence 



of non-projective phenomena see (Tapanainen and 
Jarvinen, 1997). 

Such a representation permits clear separation of 
word order (in the surface string) and syntactic de- 
pendencies (in the hierarchical structure). Thus 
we avoid explicit explanatory statements about the 
complex interrelation between word order and syn- 
tactic structure in free word order languages. Such 
statements are generally theory-specific and there- 
fore are not appropriate for a descriptive approach 
to annotation. The relation between syntactic de- 
pendencies and surface order can nonthelcss be in- 
ferred from the data. This provides a promising way 
of handling free word order phenomena. 2 . 

3 Annotation Tool 

Since syntactic annotation of corpora is time- 
consuming, a partially automated annotation tool 
has been developed in order to increase efficiency. 

3.1 The User Interface 

For optimal human-machine interaction, the tool 
supports immediate graphical representation of the 
structure being annotated. 

Since keyboard input is most efficient for assigning 
categories to words and phrases, cf. (Lehmann et al., 
1996; Marcus et al., 1994), and structural manipula- 
tions are executed most efficiently using the mouse, 
both an elaborate keyboard and optical interface is 
provided. As suggested by Robert Maclntyre 3 , it is 

throughout this paper. 

2 'Free' word order is a function of several interacting 
parameters such as category, case and topic-focus artic- 
ulation. Varying the order of words in a sentence yields 
a continuum of grammaticality judgments rather than a 
simple right-wrong distinction. 

3 personal communication, Oct. 1996 



most efficient to use one hand for structural com- 
mands with the mouse and the other hand for short 
keyboard input. 

By additionally offering online menus for com- 
mands and labels, the tool suits beginners as well 
as experienced users. Commands such as "group 
words" , "group phrases" , "ungroup" , "change la- 
bels", "re-attach nodes", "generate postscript out- 
put" , etc. are available. 

The three tagsets (word, phrase, and edge labels) 
used by the annotation tool are variable. They are 
stored together with the corpus, which allows easy 
modification and exchange of tagsets. In addition, 
appropriateness checks are performed automatically. 
Comments can be added to structures. 

Figure 2 shows a screen dump of the graphical 
interface. 

3.2 Automating Annotation 

Existing treebank annotation tools are characterised 
by a high degree of automation. The task of the 
annotator is to correct the output of a parser, i.e., 
to eliminate wrong readings, complete partial parses, 
and adjust partially incorrect ones. 

Since broad-coverage parsers for German, espe- 
cially robust parsers that assign predicate-argument 
structure and allow crossing branches, are not avail- 
able, or require an annotated traing corpus (cf. 
(Collins, 1996), (Eisner, 1996)). 

As a consequence, we have adopted a bootstrap- 
ping approach, and gradually increased the degree 
of automation using already annotated sentences as 
training material for a stochastic processing module. 

This aspect of the work has led to a new model 
of human supervision. Here automatic annotation 
and human supervision are combined interactively 
whereby annotators are asked to confirm the local 



Figure 2: Screen dump of the annotation tool 



predictions of the parser. The size of such 'super- 
vision increments' varies from local trees of depth 
one to larger chunks, depending on the amount of 
training data available. 

We distinguish six degrees of automation: 

0) Completely manual annotation. 

1) The user determines phrase boundaries and 
syntactic categories (S, NP, VP, . . . ). The pro- 
gram automatically assigns grammatical func- 
tions. The annotator can alter the assigned tags 
(cf. figure 3). 

2) The user only determines the components of a 
new phrase (local tree of depth 1), while both 
category and function labels are assigned auto- 
matically. Again, the annotator has the option 
of altering the assigned tags (cf. figure 4) . 

3) The user selects a substring and a category, 
whereas the entire structure covering the sub- 
string is determined automatically (cf. figure 5). 



4) The program performs simple bracketing, i.e., 
finds 'kernel phrases' without the user having 
to explicitly mark phrase boundaries. The task 
can be performed by a chunk parser that is 
equipped with an appropriate finite state gram- 
mar (Abney, 1996). 

5) The program suggests partial or complete 
parses. 

A set of 500 manually annotated training sen- 
tences (step 0) was sufficient for a statistical tagger 
to reliably assign grammatical functions, provided 
the user determines the elements of a phrase and 
its category (step 1). Approximately 700 additional 
sentences have been annotated this way. Annota- 
tion efficiency increased by 25 %, namely from an 
average annotation time of 4 minutes to 3 minutes 
per sentence (300 to 400 words per hour). The 1,200 
sentences were used to train the tagger for automa- 
tion step 2. Together with improvements in the 
user interface, this increased the efficiency by an- 
other 33%, from approximately 3 to 2 minutes (600 



'the bonus program for frequent fliers starting in 1993' 



'the bonus program for frequent fliers starting in 1993' 



Figure 3: Example for automation level 1: the user 
has marked das, the AP, Bonusprogramm, and the 
PP as a constituent of category NP, and the tool's 
task is to determine the new edge labels (marked 
with question marks), which are, from left to right, 
NK, NK, NK, MNR. 



Figure 5: Example for automation level 3: the user 
has marked the words as a constituent, and the tool's 
task is to determine simple sub-phrases (the AP and 
PP) as well as the new node and edge labels (cf. 
previous figures for the resulting structure). 



'the bonus program for frequent fliers starting in 1993' 

Figure 4: Example for automation level 2: the user 
has marked das, the AP, Bonusprogramm and the PP 
as a constituent, and the tool's task is to determine 
the new node and edge labels (marked with question 
marks) . 

words per hour) . The fastest annotators cover up to 
1000 words per hour. 

At present, the treebank comprises 3000 sen- 
tences, each annotated independently by two anno- 
tators. 1,200 of the sentences are compared with the 
corresponding second annotation and are cleaned, 
1,800 are currently cleaned. 

In the following sections, the automation steps 1 
and 2 are presented in detail. 

4 Tagging Grammatical Functions 
4.1 The Tagger 

In contrast to a standard part-of-speech tagger 
which estimates lexical and contextual probabilities 
of tags from sequences of word-tag pairs in a corpus, 
(e.g. (Cutting ct al., 1992; Feldweg, 1995)), the tag- 
ger for grammatical functions works with lexical and 
contextual probability measures Pq(-) depending on 
the category of the mother node (Q). Each phrasal 
category (S, VP, NP, PP etc.) is represented by a dif- 
ferent Markov model. The categories of the daugh- 



himself visited has Peter Sabine never 

'Peter never visited Sabine himself 

Figure 6: Example sentence 



tcr nodes correspond to the outputs of the Markov 
model, while grammatical functions correspond to 
states. 

The structure of a sample sentence is shown in 
figure 6. Figure 7 shows those parts of the Markov 
models for sentences (S) and verb phrases (VP) that 
represent the correct paths for the example. 4 

Given a sequence of word and phrase categories 
T = Ti . . . Tfe and a parent category Q, we cal- 
culate the sequence of grammatical functions G — 
Gi . . . Gfe that link T and Q as 



argmaxP Q (G|T) 

G 



(1) 



argmax 

G 



P Q (G) ■ P Q {T\G) 

Pq(t) 



argmaxP Q (G)-P Q (T|G) 

G 



Assuming the Markov property we have 



4 cf. appendix A for a description of tags used in the 
example 
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Figure 7: Parts of the Markov models used in Selbst besucht hat Peter Sabine nie (cf. figure 6). All unused 
states, transitions and outputs are omitted. 



k 

P Q (T\G) = l[P Q (T l \G l ) (2) 

and 

k 

P Q (G) = l[P Q (G l \C l ) (3) 

i=\ 

The contexts are modeled by a fixed number of 
surrounding elements. Currently, we use two gram- 
matical functions, which results in a trigram model: 

k 

P Q (G)=l[P Q (G t \G t - 2 ,G l - 1 ) (4) 
i=i 

The contexts are smoothed by linear interpolation 
of unigrams, bigrams, and trigrams. Their weights 
are calculated by deleted interpolation (Brown et al., 
1992). 

The predictions of the tagger are correct in ap- 
prox. 94% of all cases. In section 4.3, we demon- 
strate how to cope with wrong predictions. 

4.2 Serial Order 

As the annotation format permits trees with cross- 
ing branches, we need a convention for determining 
the relative position of overlapping sibling phrases in 
order to assign them a position in a Markov model. 
For instance, in figure 6 the range of the terminal 
node positions of VP overlaps with those of the sub- 
ject SB and the finite verb HD. Thus there is no 



single a-priori position for the VP node 5 . 

The position of a phrase depends on the position 
of its descendants. We define the relative order of 
two phrases recursively as the order of their anchors, 
i.e., some specified daughter nodes. If the anchors 
are words, we simply take their linear order. 

The exact definition of the anchor is based on lin- 
guistic knowledge. We choose the most intuitive al- 
ternative and define the anchor as the head of the 
phrase (or some equivalent function). Noun phrases 
do not necessarily have a unique head; instead, we 
use the last element in the noun kernel (elements 
of the noun kernel are determiners, adjectives, and 
nouns) to mark the anchor position. Except for 
NPs, we employ a default rule that takes the left- 
most element as the anchor in case the phrase has 
no (unique) head. 

Thus the position of the VP in figure 6 is defined 
as equal to the string position of besucht. The po- 
sition of the VP node in figure 1 is equal to that of 
aufgegeben, and the position of the NP in figure 3 is 
equivalent to that of Bonusprogramm. 

4.3 Reliability 

Experience gained from the development of the Penn 
Treebank (Marcus et al., 1994) has shown that au- 
tomatic annotation is useful only if it is absolutely 

5 Without crossing edges, the serial order of phrases 
is trivial: phrase Qi precedes phrase Q 2 if and only if 
all terminal nodes derived from Qi precede those of Q 2 - 
This suffices to uniquely determine the order of sibling 
nodes. 



correct, while wrong analyses are often difficult to 
detect and their correction can be time-consuming. 

To prevent the human annotator from miss- 
ing errors, the tagger for grammatical functions is 
equipped with a measure for the reliability of its out- 
put. 

Given a sequence of categories, the tagger cal- 
culates the most probable sequence of grammatical 
functions. In addition, it computes the probabili- 
ties of the second-best functions of each daughter 
node. If some of these probabilities are close to that 
of the best sequence, the alternatives are regarded 
as equally suited and the most probable one is not 
taken to be the sole winner, the prediction is marked 
as unreliable in the output of the tagger. 

These unreliable predictions can be further classi- 
fied in that we distinguish "unreliable" sequences as 
opposed to "almost reliable" ones. 

The distance between two probabilities for the 
best and second-best alternative, pbest and Psecond, 
is measured by their quotient. The classification of 
reliability is based on thresholds. In the current im- 
plementation we employ three degrees of reliability 
which are separated by two thresholds 6\ and 62. 
6*i separating unreliable decisions from those consid- 
ered almost reliable. 62 marks the difference between 
almost and fully reliable predictions. 
Unreliable: 

Pbest Q 
< tti 

Psecond 

The probabilities of alternative assignments are 
within some small specified distance. In this case, 
it is the annotator who has to specify the grammat- 
ical function. 
Almost reliable: 



< 



Pbest 
Psecond 



< 02 



The probability of an alternative is within some 
larger distance. In this case, the most probable func- 
tion is displayed, but the annotator has to confirm 
it. 

Reliable: 

Pbest ^ n 
> t?2 

Psecond 

The probabilities of all alternatives are much smaller 
than that of the best assignment, thus the latter is 
assigned. 

For efficiency, an extended Viterbi algorithm is 
used. Instead of keeping track of the best path only 
(cf. (Rabincr, 1989)), we keep track of all paths that 
fall into the range marked by the probability of the 
best path and 6 2 , i.e., we keep track of all alternative 
paths with probability p a i t for which 



Suitable values for 9\ and 82 were determined em- 
pirically (cf. section 6). 

5 Tagging Phrase Categories 

The second level of automation (cf. section 3) au- 
tomates the recognition of phrasal categories, and 
so frees the annotator from typing phrase labels. 
The task is performed by an extension of the tag- 
ger presented in the previous section where different 
Markov models for each category were introduced. 
The annotator determines the category of the cur- 
rent phrase, and the tool runs the appropriate model 
to determine the edge labels. 

To assign the phrase label automatically, we run 
all models in parallel. Each model assigns gram- 
matical functions and, more important for this step, 
a probability to the phrase. The model assigning 
the highest probability is assumed to be most ade- 
quate, and the corresponding label is assigned to the 
phrase. 

Formally, we calculate the phrase category Q (and 
at the same time the sequence of grammatical func- 
tions G = G\ . . . Gk) on the basis of the sequence of 
daughters T = T 1 ...T k with 

argmax max Pq(G|T). 

Q G 

This procedure is equivalent to a different view 
on the same problem involving one large (combined) 
Markov model that enables a very efficient calcula- 
tion of the maximum. 

Let Gq be the set of all grammatical functions 
that can occur within a phrase of type Q. Assume 
that these sets are pairwise disjoint. One can easily 
achieve this property by indexing all used grammat- 
ical functions with their associated phrases and, if 
necessary, duplicating labels, e.g., instead of using 
HD, MO, use the indexed labels HDg, HDy P , 
MOjvp, ... This property makes it possible to deter- 
mine a phrase category by inspecting the grammat- 
ical functions involved. 

When applied, the combined model assigns gram- 
matical functions to the elements of a phrase (not 
knowing its category in advance) . If transitions be- 
tween states representing labels with different in- 
dices are forced to zero probability (together with 
smoothing applied to other transitions), all labels as- 
signed to a phrase get the same index. This uniquely 
identifies a phrase category. 

The two additional conditions 



and 



Palt > 



Pbest 



Geg Q1 ^G? g Q2 (Qi + Q 2 ) 



G 1 eg Q AG 2 ^g Q ^ P(G 2 \G 1 ) = 



are sufficient to calculate 

argmaxP(G|T) 

G 

using the Viterbi algorithm and to identify both 
the phrase category and the respective grammatical 
functions. 

Again, as described in section 4, we calculate 
probabilities for alternative candidates in order to 
get reliability estimates. 

The overall accuracy of this approach is approx. 
95%, and higher if we only consider the reliable 
cases. Details about the accuracy are reported in 
the next section. 

6 Experiments 

To investigate the possibility of automating annota- 
tion, experiments were performed with the cleaned 
part of the treebank 6 (approx. 1,200 sentences, 
24,000 words). The first run of experiments was car- 
ried out to test tagging of grammatical functions, the 
second run to test tagging of phrase categories. 

6.1 Grammatical Functions 

This experiment tested the reliability of assigning 
grammatical functions given the category of the 
phrase and the daughter nodes (supplied by the an- 
notator) . 

Let us consider the sentence in figure 6: two se- 
quences of grammatical functions are to be deter- 
mined, namely the grammatical functions of the 
daughter nodes of S and VP. The information given 
for selbst besucht Sabine is its category (VP) and the 
daughter categories: adverb (ADV), past participle 
(WPP), and proper noun (NE). The task is to as- 
sign the functions modifier (MO) to ADV, head (HD) 
to WPP and direct (accusative) object (OA) to NE. 
Similarly, function tags are assigned to the compo- 
nents of the sentence (S). 

The tagger described in section 4 was used. 

The corpus was divided into two disjoint parts, 
one for training (90% of the respective corpus), and 
one for testing (10%). This procedure was repeated 
10 times with different partitions. Then the average 
accuracy was calculated. 

The thresholds for search beams were set to 6\ = 5 
and 9 2 = 100, i.e., a decision is classified as reliable 
if there is no alternative with a probability larger 
than yip of the best function tag. The prediction 
is classified as unreliable if the probability of an al- 
ternative is larger than g of the most probable tag. 

6 The corpus is part of the German newspaper text 
provided on the ECI CD-ROM. It has been part-of- 
speech tagged and manually corrected previously, cf. 
(Thielen and Schiller, 1995). 



Table 1: Levels of reliability and the percentage 
cases where the tagger assigned a correct grammat- 
ical function (or would have assigned if a decision is 
forced) . 





cases 


correct 


reliable 


89% 


96.7% 


marked 


7% 


84.3% 


unreliable 


4% 


57.3% 


overall 


100% 


94.2% 



If there is an alternative between these two thresh- 
olds, the prediction is classified as almost reliable 
and marked in the output (cf. section 4.3: marked 
assignments are to be confirmed by the annotator, 
unreliable assignments are deleted, annotation is left 
to the annotator). 

Table 1 shows tagging accuracy depending on the 
three different levels of reliability. The results con- 
firm the choice of reliability measures: the lower the 
reliability, the lower the accuracy. 

Table 2 shows tagging accuracy depending on the 
category of the phrase and the level of reliability. 
The table contains the following information: the 
number of all mother-daughter relations (i.e., num- 
ber of words and phrases which arc immediately 
dominated by a mother node of a particular cate- 
gory), the overall accuracy for that phrasal category 
and the accuraciees for the three reliability intervals. 

6.2 Error Analysis for Function 
Assignment 

The inspection of tagging errors reveals several 
sources of wrong assignments. Table 3 shows the 
10 most frequent errors 7 which constitute 25% of all 
errors (1509 errors occurred during 10 test runs). 

Read the table in the following way: line 2 shows 
the second- most frequent error. It concerns NPs oc- 
curring in a sentence (S); this combination occurred 
1477 times during testing. In 286 of these occur- 
rences the NP is assigned the grammatical function 
OA (accusative object) manually, but of these 286 
cases the tagger assigned the function SB (subject) 
56 times. 

The errors fall into the following classes: 

1. There is insufficient information in the node la- 
bels to disambiguate the grammatical function. 

Line 1 is an example for insufficient information. 
The tag NP is uninformative about its case and 
therefore the tagger has to distinguish SB (subject) 

7 See appendix A for a description of tags used in the 
table. 



Table 2: Tagging accuracy for assigning grammatical 
functions depending on the category of the mother 
node. For each category the first row shows the per- 
centage of branches that occur within this category 
and the overall accuracy, the following rows show the 
relative percentage and accuracy for different levels 
of reliability. 





cases correct 


s 


26% 89.1% 


decision 
marked 
no decision 


85% 92.7% 
o /o oi.y/c 
7% 52.9% 


VP 


7% 90.9% 


decision 
marked 
no decision 


97% 92.2% 

1/0 f . / /o 

2% 52.3% 


NP 


26% 96.4% 


decision 
marked 
no decision 


86% 98.6% 
10% 86.8% 
4% 73.0% 


PP 


24% 97.9% 


decision 
marked 
no decision 


92% 99.2% 
6% 85.8% 
2% 75.5% 


others 


18% 94.7% 


decision 
marked 
no decision 


91% 98.0% 
6% 82.8% 
3% 22.1% 



Table 3: The 10 most frequent errors in assigning 
grammatical functions. The table shows a mother 
and a daughter node category the frequency of this 
particular combination (sum over 10 test runs), the 
grammatical function assigned manually (and its fre- 
quency) and the grammatical function assigned by 
the tagger (and its frequency). 





phrase 


clem 


f 


original 


assig 


ned 


1. 


S 


NP 


1477 


SB 


894 


OA 


65 


2. 


s 


NP 


1477 


OA 


286 


SB 


56 


3. 


NP 


PP 


470 


PG 


52 


MNR 


50 


4. 


S 


VP 


613 


PD 


47 


OC 


42 


5. 


PP 


PP 


252 


PG 


30 


MNR 


30 


6. 


VP 


NP 


286 


DA 


32 


OA 


26 


7. 


S 


NP 


1477 


PD 


72 


SB 


25 


8. 


s 


NP 


1477 


MO 


33 


SB 


21 


9. 


s 


S 


186 


MO 


78 


PD 


21 


10. 


VP 


PP 


453 


SBP 


21 


MO 


21 



Table 4: Levels of reliability and the percentage of 
cases in which the tagger assigned a correct phrase 
category (or would have assigned if a decision is 
forced) . 





cases 


correct 


reliable 


79% 


98.5% 


marked 


16% 


90.4% 


unreliable 


5% 


65.9% 


overall 


100% 


95.4% 



and OA (accusative object) on the basis of its po- 
sition, which is not very reliable in German. Miss- 
ing information in the labels is the main source of 
errors. Therefore, we currently investigate the ben- 
efits of a morphological component and percolation 
of selected information to parent nodes. 

2. Due to the n-gram approach, the tagger only 
sees a local window of the sentences. 

Some linguistic knowledge is inherently global, e.g., 
there is at most one subject in a sentence and one 
head in a VP. Errors of this type may be reduced by 
introducing finite state constraints that restrict the 
possible sequences of functions within each phrase. 

3. The manual annotation is wrong, and a correct 
tagger prediction is counted as an error. 

At earlier stages of annotation, the main source of 
errors was wrong or missing manual annotation. In 
some cases, the tagger was able to abstract from 
these errors during the training phase and subse- 
quently assigned the correct tag for the test data. 
However, when performing a comparison against the 
corpus, these differences are marked as errors. Most 
of these errors were eliminated by comparing two 
independent annotations and cleaning up the data. 

6.3 Phrase Categories 

In this experiment, the reliability of assigning phrase 
categories given the categories of the daughter nodes 
(they are supplied by the annotator) was tested. 

Consider the sentence in figure 6: two phrase cat- 
egories are to be determined (VP and S). The in- 
formation given for selbst besucht Sabine is the se- 
quence of categories: adverb (ADV), past participle 
(WPP), and proper noun (NE). The task is to as- 
sign category VP. Subsequently, S is to be assigned 
based on the categories of the daughters VP, VAFIN, 
NE, and ADV. 

The extended tagger using a combined model as 
described in section 5 was applied. 

Again, the corpus is divided into two disjoint 
parts, one for training (90% of the corpus), and 



Table 5: Tagging accuracy for assigning phrase cate- 
gories, depending on the manually assigned category. 
For each category, the first row shows the percent- 
age of phrases belonging to a specific category (ac- 
cording to manual assignment) and the percentage 
of correct assignments. The following rows show the 
relative percentage and accuracy for different levels 
of reliability. 





cases correct 


s 


20% 97.5% 


decision 
marked 
no decision 


96% 99.7% 
2% 63.2% 
2% 29.0% 


VP 


9% 93.2% 


decision 
marked 
no decision 


71% 96.4% 
24% 91.3% 
5% 60.9% 


NP 


29% 96.1% 


decision 
marked 
no decision 


81% 99.3% 
13% 91.8% 
6% 64.9% 


PP 


24% 98.7% 


decision 
marked 
no decision 


94% 99.6% 
4% 92.5% 
2% 70.8% 


others 


18% 89.0% 


decision 
marked 
no decision 


42% 91.7% 
45% 90.6% 
12% 73.2% 



one for testing (10%). The procedure is repeated 
10 times with different partitions. Then the average 
accuracy was calculated. 

The same thresholds for search beams as for the 
first set of experiments were used. 

Table 4 shows tagging accuracy depending on the 
three different levels of reliability. 

Table 5 shows tagging accuracy depending on the 
category of the phrase and the level of reliability. 
The table contains the following information: the 
percentage of occurrences of the particular phrase, 
the overall accuracy for that phrasal category and 
the accuracy for each of the three reliability inter- 
vals. 

6.4 Error Analysis for Category 
Assignment 

When forced to make a decision (even in unreli- 
able cases) 435 errors occured during the 10 test 
runs (4.5% error rate). Table 6 shows the 10 most- 
frequent errors which constitute 50% of all errors. 

The most frequent error was the confusion of S 
and VP. They differ in that sentences S contain fi- 
nite verbs and verb phrases VP contain non-finite 
verbs. But the tagger is trained on data that con- 
tain incomplete sentences and therefore sometimes 
erroneously assumes an incomplete S instead of a 
VP. To avoid this type of error, the tagger should 
be able to take the neighborhood of phrases into ac- 
count. Then, it could detect the finite verb that 
completes the sentence. 

Adjective phrases AP and noun phrases NP are 
confused by the tagger (line 5 in table 6), since al- 
most all AP's can be NP's. This error could also 
be fixed by inspecting the context and detecting the 
associated NP. 

As for assigning grammatical functions, insuffi- 
cient information in the labels is a significant source 
of errors, cf. the second-most frequent error. A 
large number of cardinal-noun pairs forms a numer- 
ical component (NM), like 7 Millionen, 50 Prozent, 
etc (7 million, 50 percent). But this combination 
also occurs in NPs like 20 Leute, 3 Monate, . . . (20 
people, 3 months), which arc mis-tagged since they 
are less frequent. This can be fixed by introducing 
an extra tag for nouns denoting numericals. 

7 Conclusion 

A German newspaper corpus is currently being an- 
notated with a new annotation scheme especially de- 
signed for free word order languages. 

Two levels of automatic annotation (level 1: as- 
signing grammatical functions and level 2: assigning 
phrase categories) have been presented and evalu- 
ated in this paper. 
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Table 6: The 10 most frequent errors in assigning 
phrase categories (summed over reliability levels). 
The table shows the phrase category assigned manu- 
ally (and its frequency) and the category erroneously 
assigned by the tagger (and its frequency). 





phrase f 


assigned 


f 


1. 


VP 


828 


S 


46 


2. 


NP 


2812 


NM 


32 


3. 


NP 


2812 


PP 


31 


4. 


NP 


2812 


S 


25 


5. 


AP 


419 


NP 


15 


6. 


DL 


20 


CS 


15 


7. 


PP 


2298 


NP 


15 


8. 


S 


1910 


NP 


15 


9. 


AP 


419 


PP 


11 


10. 


MPN 


293 


NP 


11 



The overall accuracy for assigning grammatical 
functions is 94.2%, ranging from 89% to 98%, de- 
pending on the type of phrase. The least accuracy 
is achieved for sentences, the best for prepositional 
phrases. By suppressing unreliable decisions, preci- 
sion can be increased to range from 92% to 99%. 

The overall accuracy for assigning phrase cate- 
gories is 95.4%, ranging from 89% to 99%, depending 
the category. By suppressing unreliable decisions, 
precision can also be increased to range from 92% to 
over 99%. 

In the error analysis, the following sources of mis- 
interpretation could be identified: insufficient lin- 
guistic information in the nodes (e.g., missing case 
information), and insufficient information about the 
global structure of phrases (e.g., missing valency 
information). Morphological information in the 
tagset, for example, helps to identify the objects and 
the subject of a sentence. Using a more fine-grained 
tagset, however, requires methods for adjusting the 
granularity of the tagset to the size (and coverage) 
of the corpus, in order to cope with the sparse data 
problem. 
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A. 3 Grammatical Functions 

AC adpositional case marker 

CJ conjunct 

DA dative 

HD head 

JU junctor 

MNR post-nominal modifier 

MO modifier 

NG negation 

NK noun kernel 

OA accusative object 

OC clausal object 

PD predicative 

PG pseudo genitive 

PNC proper noun component 

SB subject 

SBP passivized subject 

SP subject or predicative 



This section contains descriptions of tags used in this 
paper. These are not complete lists. 

A.l Part-of- Speech Tags 

We use the Stuttgart-Tubingen- Tagset. The com- 
plete set is described in (Thielen and Schiller, 1995). 



ADJA 


attributive adjective 


ADJD 


adverbial adjective 


ADV 


adverb 


APPR 


preposition 


ART 


article 


CARD 


cardinal number 


FM 


foreign material 


KOKOM 


comparing conjunction 


KOUS 


sub-ordinating conjunction 


NE 


proper noun 


NN 


common noun 


PIAT 


indefinite pronoun 


PPER 


personal pronoun 


PTKNEG 


negation 


VAFIN 


finite auxiliary 


VMFIN 


finite modal verb 


WPP 


past participle of main verb 



A. 2 Phrasal Categories 

AP adjective phrase 

CS coordination of sentences 

DL discurse level 

MPN multi-word proper noun 

NM multi-token numerical 

NP noun phrase 

PP prepositional phrase 

S sentence 

VP verb phrase 
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